-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce atomic slot migration #1591
Conversation
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1591 +/- ##
============================================
- Coverage 70.78% 70.58% -0.21%
============================================
Files 120 121 +1
Lines 65046 65595 +549
============================================
+ Hits 46045 46302 +257
- Misses 19001 19293 +292
|
Do you have plan to implement the following 2 cases ?
|
All this needs to do is call freeClient on the slot migration source client and delete all keys in the slot bitmap of that migration. The source is tracking the slot migration and when the client close notification comes in - it frees its local tracking information. This is the same process that already occurs if a migration times out.
The implementation plan would be:
Note that due to consensus-less epoch bump - if there is a race between failover and slot migration - both may succeed, and later one of those will win deterministically based on the epoch collision protocol - so we will lose some writes on the epoch collision losing side due to a period of two nodes declaring primaryship. This is the same issue that exists in the current consensus-less slot migration implementation - and we are looking to address it as part of #1355 instead |
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Hi folks, so @enjoy-binbin, @PingXie, and I discussed offline. Tencent has offered a solution that they have developed internally. Moving forward, @enjoy-binbin and I will join efforts and work on bridging gaps in the Tencent solution to meet the requirements we outlined in #23. Hopefully we will have a shared PR for review soon. Given this, I am going to go ahead and close this pull request. |
Starting this PR in draft so we can review early and gather alignment.
Summary
The solution is mostly based on discussion and proposals in #23 :
CLUSTER IMPORT
to the target and it syncs down the slotImport and Export Workflows
We utilize a new internal
CLUSTER SYNCSLOTS
command with additional sub-commands to transition through the slot migration state machine.On either end of the migration, we track the ongoing migrations via two job queues: slot_import_jobs and slot_export_jobs. Right now, we only support one concurrent job in both the slot import and slot export job queue. There is no design restriction here (outside of perhaps some protocol additions to
CLUSTER SYNCSLOTS
- but it helps with simplicity for the first iteration.Workflow overview
CLUSTER IMPORT SLOTSRANGE <slot start> <slot end> ...
to target nodeT
T
initiates a new connection to source nodeS
T
runsAUTH
based on replication configurationT
initiatesCLUSTER SYNCSLOTS START
toS
S
begins tracking the client that sent the command as the slot export client. It spawns a child process at the next available time and runs AOF rewrite with just the specified slots. It then begins accumulating a backlog of writes in the slot export client output buffer, without installing the write handler.S
also appendsCLUSTER SYNCSLOTS ENDSNAPSHOT
T
processes the AOF rewrite as it would any other client usingreadQueryFromClient
. Once it gets theCLUSTER SYNCSLOTS ENDSNAPSHOT
,T
sends back aCLUSTER SYNCSLOTS PAUSE
to pauseS
.S
unblocks the slot export client toT
which has been accumulating ongoing writes.S
then pauses itself, and sendsCLUSTER SYNCSLOTS PAUSEOFFSET <offset>
back toT
with the current offset. (Note that the offset is not the primary replication offset, it is actually a computed offset based on how much we have been accumulating onT
's client.)T
waits for it's replication offset to catch up to the sent offset, and once caught up executes the consensus-less bump.S
finds out about the bump via cluster gossip, unpauses itself, and cleans up dirty keys.If at any point a client is disconnected on either end, or a timeout is reached on the target node, the migration is marked as failed. If a migration fails, we delete all keys in the slots we were migrating.
Filtering traffic
We filter the traffic to the target node through a filtered AOF rewrite and a filtered replication stream. The filtered AOF rewrite requires some refactoring of the snapshot code for reuse, but utilizes the same overall procedure (piping through the parent process to the target node connection).
The filtered replication stream hooks into the existing replication code and appends the commands directly to the client output buffer. We don't use replication backlog as there is no easy way to filter it once added to the backlog without re-processing it to query for the slot number of each command.
We add a new check in
putClientInPendingWriteQueue
to prevent the two command streams from merging. The parent process will just accumulate the replication stream in the client output buffer until we get the notification that the target is done with the snapshot.Other notes
replicated
flag instead of theprimary
flag.Remaining work items
KEYS
andRANDOMKEY
CLUSTER IMPORT STATUS
,CLUSTER IMPORT CANCEL
)